Goto

Collaborating Authors

 multilingual training



Robust Optimization for Multilingual Translation with Imbalanced Data

Neural Information Processing Systems

Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ($+0.8$ to $+2.2$ BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.



Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

Ge, Peixuan, Su, Tongkun, Lv, Faqin, Zhao, Baoliang, Zhang, Peng, Wong, Chi Hong, Yao, Liang, Sun, Yu, Wang, Zenan, Wong, Pak Kin, Hu, Ying

arXiv.org Artificial Intelligence

Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2\% in BLEU scores, approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.


MultiConAD: A Unified Multilingual Conversational Dataset for Early Alzheimer's Detection

Shakeri, Arezo, Farmanbar, Mina, Balog, Krisztian

arXiv.org Artificial Intelligence

Dementia is a progressive cognitive syndrome with Alzheimer's disease (AD) as the leading cause. Conversation-based AD detection offers a cost-effective alternative to clinical methods, as language dysfunction is an early biomarker of AD. However, most prior research has framed AD detection as a binary classification problem, limiting the ability to identify Mild Cognitive Impairment (MCI)-a crucial stage for early intervention. Also, studies primarily rely on single-language datasets, mainly in English, restricting cross-language generalizability. To address this gap, we make three key contributions. First, we introduce a novel, multilingual dataset for AD detection by unifying 16 publicly available dementia-related conversational datasets. This corpus spans English, Spanish, Chinese, and Greek and incorporates both audio and text data derived from a variety of cognitive assessment tasks. Second, we perform finer-grained classification, including MCI, and evaluate various classifiers using sparse and dense text representations. Third, we conduct experiments in monolingual and multilingual settings, finding that some languages benefit from multilingual training while others perform better independently. This study highlights the challenges in multilingual AD detection and enables future research on both language-specific approaches and techniques aimed at improving model generalization and robustness.


Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation

Zheng, Tong, Wen, Yan, Bao, Huiwen, Guo, Junfeng, Huang, Heng

arXiv.org Artificial Intelligence

The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.


Scaling for Fairness? Analyzing Model Size, Data Composition, and Multilinguality in Vision-Language Bias

Sahili, Zahraa Al, Patras, Ioannis, Purver, Matthew

arXiv.org Artificial Intelligence

As large scale vision language models become increasingly central to modern AI applications, understanding and mitigating social biases in these systems has never been more critical. We investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open source variants. In particular, we systematically evaluate models trained on varying dataset scales and architectures, as well as multilingual versions encompassing English along with Persian, Turkish, and Finnish,languages with minimal gender marking. To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms rooted in the psychological constructs of communion and agency, and demographic labeling bias using both the FairFace and PATA datasets. Our findings reveal three key insights. First, while larger training datasets can mitigate some biases, they may also introduce or amplify others when the data composition is imbalanced. Second, although increasing model size generally improves performance, it does not consistently reduce bias and can, in certain cases, exacerbate it. Finally, while multilingual training broadens linguistic coverage, it does not inherently neutralize bias and can transfer or intensify inequities across languages. Taken together, these results highlight the necessity of inclusive, carefully curated training data to foster fairness rather than relying solely on model scaling or language expansion. We provide a systematic evaluation for vision language bias across diverse demographics, underscoring the urgent need for intentional bias mitigation strategies in next-generation AI systems.


Robust Optimization for Multilingual Translation with Imbalanced Data

Neural Information Processing Systems

Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages.


Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Kim, Minu, Jang, Kangwook, Kim, Hoirin

arXiv.org Artificial Intelligence

This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.


Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People

Wang, Xidong, Chen, Nuo, Chen, Junyin, Hu, Yan, Wang, Yidong, Wu, Xiangbo, Gao, Anningzhe, Wan, Xiang, Li, Haizhou, Wang, Benyou

arXiv.org Artificial Intelligence

Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion.